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Abstract 

Gaussian processes (GPs) are widely used in nonparametric regression, clas- 
sification and spatio-temporal modeling, motivated in part by a rich literature on 
theoretical properties. However, a well known drawback of GPs that limits their 
use is the expensive computation, typically 0(n 3 ) in performing the necessary 
matrix inversions with n denoting the number of data points. In large data sets, 
data storage and processing also lead to computational bottlenecks and numerical 
stability of the estimates and predicted values degrades with n. To address these 
problems, a rich variety of methods have been proposed, with recent options in- 
cluding predictive processes in spatial data analysis and subset of regressors in 
machine learning. The underlying idea in these approaches is to use a subset of the 
data, leading to questions of sensitivity to the subset and limitations in estimating 
fine scale structure in regions that are not well covered by the subset. Motivated 
by the literature on compressive sensing, we propose an alternative random pro- 
jection of all the data points onto a lower-dimensional subspace. We demonstrate 
the superiority of this approach from a theoretical perspective and through the use 
of simulated and real data examples. 

Some Keywords: Bayesian; Compressive Sensing; Dimension Reduction; Gaus- 
sian Processes; Random Projections; Subset Selection 

1 Introduction 

In many application areas we are interested in modeling an unknown function and 
predicting its values at unobserved locations. Gaussian processes are used routinely 



in these scena r ios, ex amples include modeling spatial random effects dBaneriee et al 



2004 ICressieL 1 1992b and supervised classification or prediction in machine learning 



( Rasmussenl 2004F Seegei , 2004 ). Gaussian processes are mathematically tractable 



have desirable properties and provide a probabilistic set-up facilitating statistical in- 
ference. When we have noisy observations y\ , . . . , y n from the unknown function 
/ : X — > observed at locations x\, . . . , x n respectively, let 

Vi = f{xi) + e l , fori = !,-■• ,n, (1) 
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where ej is the associated idiosyncratic noise. We let e$ ~ N(0,er 2 ) for simplicity. 
However, for the techniques we develop here, other noise distributions may be used, 
including heavy tailed ones. The unknown function /(•) is assumed to be a realization 
from a Gaussian process with mean function /i(-) and positive definite covariance ker- 
nel k(-, ■), so that E{f(x)} = /i(x) and cov{/(x), f(x')} = k(x, x') for all x, x' E X. 

The realizations of /(•) at the sample points x\, . . . ,x n have a multivariate Gaus- 
sian prior, with evaluations of the resulting posterior and computations involved in cal- 
culating predictive means and other summaries involving 0(n 3 ) computation unless 
the covariance has a special structure that can be exploited. Markov chain Monte Carlo 
algorithms for posterior computation allowing uncertainty in the residual variance a 1 
and unknown parameters in the mean function /i(-) and covariance fc(-) may require 
such computations at every one of a large number of iterations. Another concern is 
declining accuracy of the estimates as the dimension increases, as matrix inversion be- 
comes more unstable with the propagation of errors due to finite machine precision. 
This problem is more acute if the covariance matrix is nearly rank deficient, which is 
often the case when /(•) is considered at nearby points. 

The above problems necessitate approximation techniques. Most approaches ap- 
proximate /(■) with another process g(-) that is constraine d to a reduced ra nk subspace. 
One popular strategy specifies g ( • ) as a kernel convolution dHigdon , 2002), with related 
approaches instead relying on other bases such as low rank splines or mov ing averages 
dWikle & Cressielll999l:lxia & Gelfandl l2006t iKammann & Wandll2003l) . A concern 
with these approaches is the choice of basis. There ar e also restrictions on t he class 
of covariance kernels admitting such representations. iBanerjee et al. ( 20081) instead 
proposed a predictive process method that imputes /(•) condit ionally on the values 
at a finite number of knots, with a similar method proposed by iTokdar j ([20071) for lo- 
gistic Gaussian processes. Subset of regressors dSmola & Bartleti 1200 ll) is a closely 
related method to the predictive process that was proposed in the machine learning 
literature and essentially ignored in stat istics. Both of the se approaches substantially 
underestimate predictive var iance, with Finlev et all ( 2009) proposing a bias correction 
in the statistics literature and lSnelson & Ghahrama ni (2006) independently developing 
an essentially identical approach in machine learning. Alter native methods to a djust 
for underestimation of predic tive variance were proposed in ISeeger et alj d2003l) and 



Schwaighofer & Trespl (12003 ). 

~~ nT( 



Qui & Rasmusse nT (l2005h proposed a unifying framework that encompasses essen- 



tially all of these subset of regressors-type approximation techniques, showing that they 
can be viewed as an approximation to the prior on the unknown function, rather than 
its posterior. While these methods do not require choice of a basis, an equally difficult 
problem arises in determini ng the location and spacing of knots, with the choice hav- 
ing a substantial impact. In|Tpkdar (2007) in the context of density estimation and in 
unpublished work by Guhaniyogi, Finley, Banerjee and Gelfand in the context of spa- 
tial regression, methods are proposed for allowing uncertain numbers and locations of 
knots in the predictive process using reversible jump and preferential sampling. Unfor- 
tunately, such free knot methods increase the computational burden substantially, par- 
tially eliminating the computational savings due to a low rank method. In the machine 
learning literature, various optimization methods have been proposed for knot selec- 
tion, typically under the assumption that the knots correspond to a subset of the data 
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points. Such methods include online le arning dCsato & Oppen,l2002l) . gre edy posterior 
maxi mization dSmola & Bartlett, 2001), maximum information criterion dSeeger et all 
120031) . and matching pursuit (IKeerthi & C hu. 2006) among others. 

In this article, we propose a new type of approximation method that bypasses 
the discrete knot selection problem using random projections. The methodology is 
straightforward to implement in practice, has a theoretical justification and provides a 
natural generalizati on of knot-based m ethods, with pivoted factorizations and the in- 
tuitive alg orithm of Finlev et al. (2009) arising as special cases. Motivated bv lSarlos 
(20061 and lHalko etal.ld2009b we use generalized matrix factorizations to improve nu- 
merical stability of the estimates, a problem which is typically glossed over. The inspi- 
ration for our method arises out of the success of random projection techniques, such as 
compressed sensing dCandes et all 2006; D onoho , 2006), in a rich variety of contexts 
in machine learning. Most of this literature focuses on the ability to reconstruct a signal 
from compressive measurements, with theoretical guarantees provided on the accuracy 
of a point estimate under sparsity assumptions. In contrast, our goal is to accurately 
approximate the posterior distribution for the unknown function in a fundamentally 
different setting. We also explore how these approximations affect inference on the co- 
variance kernel parameters controlling smoothness of the function, an issue essentially 
ignored in earlier articles. Our theory suggests that predictive process-type approxima- 
tions may lead to high correlations between the imputed process and parameters, while 
our method overcomes this problem. 



2 Random Projection Approximation Methodology 

2.1 Predictive Processes and Subset of Regressors 

As a first step, we place the predictive process and subset of regressors methods un- 
der a common umbrella. Consider equation (Q~|), with /i = for notational clarity, 
and let X* = {x*, . . . ,x* m } denote a set of knots in X. Letting /* = f(X*) = 
{f(x*) t . . . , /(^m)} T denote the function /(•) evaluated at the knots, the predictive 
process replaces /(•) by g( - ) = E{ /(•)!/*}, with g(-) a kriged surface in spatial 
statistics terminology dSteinLll999l) . It follows from standard multivariate normal the- 
ory that for any x G X,g{x) = (k x ^) T (K*^)^ 1 f* , where fc X) * is the m x 1 vector 
{k(x, x*), k(x, xl), . . . , k(x, Xm)} T and if*,* is the m x m matrix with k(x*, x*) in 
element 

Subset of regressors is instead obtained via an approximation to Kfj= cov{f(X}). 
Letting 



ifau S =COv[{/(X) T ,(f f} 



/ , r ;■' < <' I K fJ K f,* 



an optimal (in a sense to be described later) approximation to Ktj is obtained as 
Qjj = K f^(K*^)~ 1 K if j, with Qij — K i ^(K if ^)~ 1 K if ^ denoting cell of 
Q/,/.This approximation Qfj is equivalentto cov{g(X)} obtained from the predictive 
process approximation, a nd hence the two approaches are equivalent. As shown in 
lOui & Rasmussenl d2005l) . <?(•) is effectively drawn from a Gaussian process with the 
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degenerate covariance kernel 



qsonix, z) = (fc x ,*) T (-^*,*) 1 k*, z , 

where k* >z = {k(x*, z), . . . , k(x* ml z)} T . From equation (Q~|i, we obtain Y = (yi, . . . , y n ) T 
N(0; a 2 I + Kfj) in marginalizing out / over the exact prior / ~ GP(0, k). If we use 
the approximated version, we have 

yi=g(xi) + ei, ei -N(0,cr 2 ). (2) 

Marginalizing out g, we obtain Y ~ N(0, a 2 1 + Qfj)- 

Let Y Q denote the vector of length n a of observed values and Y p the vector of 
length n p of values to predict, with n a + n p = n. Under the above approximation, 
the conditional predictive distribution is (Y^|Yo) ~ N{Q Pt0 (Q 0y0 + <r 2 I)~ l Y , Q 0j0 — 
] P ,o{Qo,o + o- 2 /)^ 1 Q ,p}, wit h Qo.q, Q„ „, Q v o denoting submatrices of Q f j. Using 



the Woodbury matrix identity dHarvillel l2008h yields (Q , + a 2 !)- 1 = o~ 2 {I 



K *(<j 2 K*.i, + K*. K 0Jr ) 1 K Sft0 ], with calculation involving an to x m matrix. 



Finlev et al. (2009) show that the predictive process systematically underestimates 
variance, since at any x G X, var{/(x)} — var{g(x)} = var{/(a;) | /*} > 0. To adjust 
for this underestimation, they replace g(-)by g(-) + e g (-), with e g (x) ~ N{0, k(x, x) — 
an d c °v{tg{xi), £ g (x2)} = for xi 7^ X2- Hence, in place of 
equation (fJJ, we have 

Vi = g{xi) + e g (xi) + ei, ei~N(0,cr 2 ). 

A variety of methods for addressing the variance under-es timation problem were in - 
dependently developed in the machine learning literature (Qui & Rasmussen, 2005), 



with th e fully independe nt training conditional approximation cor responding exactly 
to the iFinley et al.1 (120091) approach. iSnelson & Ghahramanil (120061) also proposed this 
approach under the sparse Gaussian process with pseudo inputs moniker. In each of 
these cases, gm (■) = <?(•) + e 9 (-) is effectively drawn from a Gaussian process with 
the degenerate covariance kernel 

qFirc(x, z) = qsoB.(x, z) + S(x, z){k(x, z) - q S OB.(x, z)}, 

where S(x, z) = 1 if x = z and otherwise. Our proposed random projection method 
will generalize these knot-based approaches, leading to some substantial practical ad- 
vantages. 



2.2 Generalization: Random Projection Method 

The key idea for random projection approximation is to use gnp(-) = E{f(-)\$f(X)} 
instead of g(-) — E{f(-)\f*}, where $ is some m x n matrix. The approximation 
9rp{ ) is drawn from a Gaussian process with covariance kernel, 

q RP (x,z) = {$k x j) T ($K fJ $ T )- 1 $k f , z , 

where k x j — {k(x, x\), . . . , k(x, x n )} T and kf iZ — {fc(xi, z), . . . , k{x n , z)} T . As 
in the methods of §2 • 1, we face the variance under-estimation issue with var{/(x)} — 



4 



var{gj>p( x)| = var{/ ( x) | <& f(X)} > 0. Following the same strategy for bias correc- 
tion as in iFinlev et al.1 d2009l) . we let g R M (•) denote the modified random projection 
approximation having covariance kernel 

q RM (x, z) = q RP {x, z) + S(x, z){k(x, z) - q RP (x, z)}. (3) 

When 3> is the submatrix formed by m rows of a permutation matrix of order n 



(Harville, 2008), we revert back to the formulation of §2 • 1, where the knots are an 
m dimensional subset of the set of data locations X. We consider more generally 
$ € C, the class of random matrices with full row-rank and with row-norm = 1 to 
avoid arbitrary scale problems. Before discussing construction of $, we consider some 
properties of the random projection approach. 

2.3 Properties of the RP method 

(1) Limiting Case: When m = n, $ is a square non-singular matrix. Therefore, 
($Kfj<b T )~ 1 = ($ T ) _1 if7i$ _1 , so that Qfj = Kfj, and we get back the origi- 
nal process with a full rank random projection. 

(2) Opti mality in ter ms of Hilbert space projections: It is well known in the theory 



of kriging ( Stein, 1999) that taking conditional expectation gives the orthogonal pro 



jection into the corresponding space of random variables. Let W{f(X), <£>} denote the 
Hilbert space spanned by linear combinations of the m random variables and 
equipped with the inner product (/i,/ 2 ) = E(f 1 f 2 ) for any f 1 J 2 <E H{f(X), $}. 
The orthogonal projection of / to the Hilbert space is f opt — argmin^^sfrx *} ||/ — 
From kriging theory f° pt {x) = (4>k xJ ) T ' (K fJ )- l $f(X) = E{f(x)\$f(X)}. 
Hence, the random projection approximation is optimal in this sense. As f opt is a 
function of $ G C, the best possible random projection approximation to / could be 
obtained by choosing $ to minimize \\f opt — f\\. As the predictive process-type ap- 
proaches in §2 • 1 instead restrict $ to a subset of C, the best possible approximation 
under such approaches is never better than that for the random projection. While find- 
ing the best $ is not feasible computationally, §3 proposes a stochastic search algorithm 
that yields approximations that achieve any desired accuracy level with minimal addi- 
tional computational complexity. 

(3) Relationship with partial matrix decompositions: We briefly discussed in §2 • 
1 that the approximations in the machine learning literature were viewed as reduced 
rank approximations to the covariance matrices. Here we make an explicit connection 
between matrix approximation and our rand om projection scheme, whic h we build on 
in the next section. The Nystrom scheme ( Drineas & Mahoneyl 2005 ) considers the 



rank m approximations to nxn positive semidefinite matrix A using mx n matrix B, 
by giving an approximate generalized Choleski decomposition of A as CC T , where 
C = (BA) T (BAB T )~ 1 ^ 2 . The performance of the Nystom scheme depends on how 
well the range of B approximates the range of A. As in property (1), let Qf P j be the 
random projection approximation to Kfj. It is easy to see that Qf P corresponds to a 

( fJ , with C = i'l'A, Mh/w,''' 7 



Nystrom approximation to Kfj, with C = (<&K f j) T \<&K fj$ T ) 1 ^ 2 - 
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The Nystrom characterization allows us to obtain a reduced singular value decom- 
position utilizing the positive definite property as considered in det ail in §3. The partia l 
Choleski decompositions for the covariance matrices, advocated in lFoster et al. N2009I) 
for approaches in §2 • 1, arise as special cases of the Nystrom scheme using permutation 
submatrices; arguing on the lines of property (2), best case accuracy with the random 
projection is at least as good as the partial Choleski decomposition. We later show 
empirically random projection performs substantially better. 

(4) Relationship with truncated series expansions: The random projection approxi- 
mation also arises from a finite basis approxim ation to the stochastic process /. Under 



the Karhunen-Loeve expansion dAdle r, 1990) 



z 



f(x)=J2m(\) 1/2 ei(x), x&X, 
where X is compact and A; , e, are eigenvalues and eigenvectors, respectiv ely, of the 



covariance function k, given by the Fredholm equation of the second kind as dGrigoriu , 
20021) . 

k{x\,x)ei{x)dx — \iei{xi), x G X 



x 



r/i's are independent N(0, 1) random variables by virtue of properties of the Gaussian 
process. Using Mercer's theorem, which is generalization of the spe ctral theorem fo r 



positive definite matrices, we can express the covariance function as (Grigoriu, 2002) 



k(xi,x 2 ) = ^2 Xie l {xi)e i {x2), xi,x 2 G X. 

i=l 

Assume that the eigenvalues in each of the above expansions are in descending order. 
Let ftr(x) = X)I=i Vi(^i) e i{ x ) be the approximation to fix) obtained by finitely 
truncating the Karhunen-Loeve expansion, keeping only the m largest eigenvalues. The 
covariance function for f tr is given by k tr {xi,X2) — 2~Z"=i ^i e i( x i) e i( x 2)i x\,x 2 G 
X, which is, as expected, a corresponding truncation of the expression in Mercer's the- 
orem. If we now evaluate the truncated covariance function on the set of points of 
interest, X, we get the covariance matrix, K tr = EAE T , where E is the nxm matrix 
with (i,j) th element given by ej(Xi) and A is a m x m diagonal matrix with the m 
eigenvalues in its diagonal. The Karhunen Loeve expansion considers orthogonal func- 
tions so that f x ei(x)ej(x)dx = whenever i ^ j. If we use the quadrature rule with 
equal weights for approximation of the integral with the n locations of interest, we 
have XT=i e i( x i) e j( x i) = 0> which means that the matrix E is approximately row- 
orthogonal. Assuming that E is exactly orthogonal the truncated Mercer expansion 
matrix K tr is essentially a reduced rank m spectral decomposition for the actual covari- 
ance matrix. The covariance matrix of the random vector gRp(X) is equal to the rank 
m spectral decomposition when we choose the projection matrix $ equal to the first m 
eigenvectors of the actual covariance matrix, as shown in the next section. Therefore 
gnp(X) has the same probability distribution as f tr {X). In other cases, when $ ^ the 
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eigenvectors, as in approaches in section §2 • 1, its easy to show that the random pro- 
jection corresponds to some other truncated basis expansion in the same way as above. 
The Karhunen Loeve is however the optimal expansion in the sense that for each to, for 
any other h tr (-) from some m truncated basi s expansion, fyE\jf(x) — ht r (x)} 2 ]dx 
is minimized over \i± T 

(■), for h tr {-) = ftr(-) dGhanem & Spanosll2003h . 

3 Matrix Approximations & Projection Construction 
3.1 Reduced rank matrix approximations 

We introduce stochastic matrix approximation techniques that enable us to calculate 
nearly optimal projections. We start with some key concepts from linear algebra. Let 
|| ■ || 2 and || ■ \\p denote the spectral and Frobenius norm for matrices and let K be 
any n x n positive definite matrix. We focus entirely on positive definite matrices. 
A spectral decomposition of K is given by, K — UDU T , where D is a diagonal 
matrix whose diagonal elements are the eigenvalues. Since K is positive definite, this 
is equivalent to the singular value decomposition and the eigenvalues are equal to the 
singular-values and > 0. U is an orthonormal matrix whose columns are eigenvectors 
of K. Consider any n x n permutation matrix P, and since PP T = I we have, 

UDU T = UPP T DPP T U T = (UP)(PDP T )(UP) T . 

Therefore any permutation of the singular values and their respective vectors leads 
to an equivalent spectral decomposition for K, and it can be shown that the spectral 
decomposition is unique up to permutations. Henceforth we shall consider only the 
unique spectral decomposition in which the diagonal elements of D are ordered in 
increasing order of magnitude, d u > c? 2 2 ■ ■ ■ > d nn . Consider the following partition 
for the spectral decomposition, 

K = [U m ?7( n _ TO )] 

where D mm is the diagonal matrix containing the to largest eigenvalues of K and U m 
is the n x m m atrix of corresp onding eigenvectors. Then it follows from the Eckart- 
Young theorem dStewarll. ll 993) that the best rank m approximation to K is given by 
Km = UmDmmU^, in terms of both || ■ H2 and || ■ \\p . In fact it can be shown that 

\\K-Km\\ 2 F = E?= m+ l€- 

Recall that the crux of our random projection scheme was replacing the covariance 
matrix K by (^K) T (^K^ T )~ 1 (^K), where $ is our random projection matrix. Now 
if we choose $ = then, 

= {Uj n Kf(U T m KU m y x {UlK) 

= {(^ mm 0)i7 T } T (^ mm )^ 1 {(£'mm0)[/ T } = K m , 

where above is an m x (n — m) matrix of zeroes. Therefore the best approximation 
in our scheme is obtained when we have the first m eigenvectors of the SVD forming 
our random projection matrix. 



P^m.vn, 



'mm 

D< 



[U m ?7( n - m )]" 
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The problem however is that obtaining the spectral decomposition is as burdensome 
as computing the matrix inverse, with 0(n 3 ) computations involved. Recent articles in 
machine learning in the field of matrix approximation and matrix completion have de- 
vised random appro ximation schemes which give n ear optimal performance with lesser 



computational cost dHalko et all l2009t : ISarlosL 120061) . We can consider these stochastic 



schemes to address either (i) Given a fixed rank m, what is the near optimal projection 
for that rank and what is the corresponding error; or (ii) Given a fixed accuracy level 
1 — e, what is the near optimal rank for which we can achieve this and the corresponding 
projection. We consider each of these questions below. 

We first address the fixed rank problem. For any matrix K of order n x n and 
a random vector io of order n X 1, Ku> is a vector in the range of K. For an n x r 
random matrix £1 with independent entries from some continuous distribution, KD, 
gives r independent vectors in the range of K with probability 1. There can be at most 
n such independent vectors, since the dimension of the range = n. As we mentioned 
earlier, when we evaluate the Gaussian process at a fine grid of points, the covariance 
matrix K is often severely rank deficient and we should be able to accurately capture 
its range with m « n vectors. 

The next question is how to choose the random matrix VL. The product KVl em- 
beds the matrix K from a TZ nxn space into a ~JZ nxr space. Embeddings with low 
distortion properties have been well studied a nd Johnson-Lindenstrauss transforms 



( John son et ; 



roperties have been well studied and Johnson-Lindenstrauss transrorms 
ifl[l986; iDasgupta & Guptal 12003b are among the most popular low di- 



mensional projections. A matrix of order n x r is said to be a Johnson-Lindenstrauss 
transform for a subspace V of lZ n if | — \\v\\ | is small for all v 6 V with high 
probab ility. For the p recise definition of the transform, we refer the readers to Defini- 
tion 1. !Sarlosl (120061) . Initially it was shown that with (i,j) th element = (^Wy), 
where wy independent ~ N(0, 1) would have Johnson-Lindenstrauss property. Later 
it has been shown that cjjj 's may be considered to be independent Ra demacher or com- 



ing from a uniform distribu tion from the corresponding hypersphere dAchlioptas , 2003 



Arriaga & Vempalai 12006b . The compressive sensing literature has dealt with these 



choices in some detail and has found no substan tial gain in accuracy in signal com 



pression in using one kind over the other (ICandes et al. , 2006; iDonohol 120061) - our 
experiments in the present context concur. 

Having formed Kil the concluding step in our matrix approximation scheme is 
to find $. We first perform a low distortion low dimensional Johnson-Lindenstrauss 
embedding for the covariance matrix and perform the rank m projection for this em- 
bedding to come up with $. It is easy to then calculate the approximate spectral de- 
composition of the covariance based on the Nystrom approximation for the random 
proje ction. The ex act steps are shown below in Algorith m Q] which combines ideas 
from ISarlosI J2006h and algorithm 5.5 in Hal ko et al. (2009t). 



We give the following result f or the approxim ation accuracy of AlgorithmQ] which 



is a modification of theorem 14 in Sarlo; 



approxin 
^sl i2006h 



Theorem 1. Consider any < e < 1 and r — |_— J. Obtain K tr from Algorithm\l\for 
the positive definite matrix K and let K m be the best rank m approximation for K if 
terms of \\ ■ \\p. Then, 
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Algorithm 1 Approximate spectral decomposition via Nystrom method for target rank 

m 

Given a positive definite matrix K of order n x n and a randomly generated Johnson- 
Lindenstrauss matrix Q of order r x n, we find the projection matrix $ of order to x n 
which approximates the range and compute the approximate SVD decomposition via 
Nystrom approximation with $. 

1. Form the matrix product Kfl. 

2. Compute $ T = left factor of the rank m spectral projection of the small matrix Ktt. 

3. Form K x = $if$ T . 

4. Perform a Choleski factorization of K\ = BB T . 

5. Calculate the Nystrom factor C = K$ T (B T )- 1 . 

6. Compute a spectral decomposition for C = U DV T . 

7. Calculate the approximate spectral decomposition for K « K tr — UD 2 U T . 



pr{\\K-K tr \\ < (l + e)\\K -K rn \\ F } > i 

With the advances in parallel computing technology and current stress on GPU 
computing, we may implement a parallel version of Algorithm Q] by running steps 

1 & 2 in parallel for several copies of the matrix fi; with log(jj) copies, we can sharpen 
the probability in theorem Q] to 1 — T], In our implementations of algorithm Q] we use 
r = to. The algorithm involves decomposition of the small matrix QK& T which 
involves 0(to 3 ) operations. The matrix multiplications involved, for example in com- 
puting K\ are 0(n 2 m), which is the additional cost we pay to have the random pro- 
jection generalization of the algorithms in §2 • 1. Matrix multiplication can be done 
in parallel, indeed it is the default approach in standard linear algebra packages such 
as BLAS3 used in Matlab versions 8 and above, and the constants associated with the 
order of complexity for matrix multiplication is lower than that for inversion. Our re- 
sults section indicate that added computational complexity in terms of real CPU time 
is indeed negligible for the random projection algorithm versus techniques in §2 • 1. In 
fact with the target error algorithm below, we often achieve lower times than predictive 
process type approaches of §2 • 1, since the rank required to achieve the target error is 
substantially smaller. 

We now answer the fixed accuracy level question. The eigenvector matrix U from 
the SVD captures the column space/range of the matrix K, in the sense that K — 
UU T K. In general we consider the error in range approximation \\K — $ T &K\\ V (?/ = 

2 or F), as it makes it easier to evaluate the target accuracy. Using simple linear algebra, 
UmU^K — K m , so that the best rank to range approximator is the same as the rank 
to SVD app roximation. It suffice s to the n search for good range a pproximator s , since 
lemma 4 in iDrineas & M ahonev (2005) and discussion in §5.4. iHalko et alj (2009) 



show that the error with the Nystrom approximator is at least as small as the error in 
range approximation, and empirically is often substantially smaller. We need only find 
the projection matrix $ for the range approximation given the target error level and 
computation of the approximate spectral decomposition using this $ proceeds as in 
steps 3 — 7 of AlgorithmQ] $ can be obtained to satisfy any target error level by trivial 
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modification of steps from algorithm 4.2 in iHalko et a l. (2009) in place of steps 1 & 2 
in AlgorithmQ] summarized below in Algorithm[2] 



Algorithm 2 Finding range satisfying target error condition 

Given a positive definite matrix K of order n x n and target error e > , we find the 
projection matrix $ of order m x n which gives \\K — $> T QK \\ < e with probability 
1 - -2- 

10 r- 

1. Initialize j = and $ = [], the Oxn empty matrix. 

2. Draw r random vectors a/ 1 ) , . . . , uj^ each of order n x 1 with independent entries 
fromN(0,l). 

3. Compute «W = Ku>^' for i = 1, . . . , r. 

4. Is max l= i,... :r (||K^ +l )||) < ? If yes, step 11. If no, step 5. 

5. Recompute j = j + 1 , = [I - {$(J- 1 )} t $0'- 1 )]k;« and ^ = JjgL . 
$0'-i) 

7. Draw arixl random vector u : > +r with independent N(0, 1) entries. 

8. Compute = [J _ {$W } T $0')]if w O+''). 

9. Recompute re« = «W - ^0') , K W) for i = (j + 1), . . . , (j + r - 1). 

10. Back to target error check in step 4. 

11. Output $ = 



6. Set $0) = 



Step 9 above is not essential, it ensures better stability when k vectors become very 
small. In our implementations of algorithm |2] we use an r such that yjp = 0.1 to 
maintain probability of 0.9 of achieving the error level. The computational require- 
ments of Algorithm [ 2] are similar to that of Q] for more details we refer the reader to 
<)4 • 4 in Halko et alj ([2009). Posterior fit and prediction in Gaussian process regression 
usually involves integrating out the Gaussian process, as indicated in §4. We end this 
subsection with another result which shows that target error in prior covariance matrix 
approximation governs the error in the marginal distribution of the data, integrating out 
the Gaussian process. 

Theorem 2. Let Y = (yi, 7/2, ■ • ■ , Un) T be the observed data points and let irf u u = 
J ir{YJ(X)}dP{f(X)}, ttrp = J ir{Y,g R p(X)}dP{g RP (X)} their correspond- 
ing marginal distributions. If\\Kfj — Q^\\p < e, which is the error in approximation 
of the covariance matrix, then the Kullback Leibler divergence between the marginal 
distributions from the full and approximated Gaussian process, 

KL(tt f u u, ttrp) < |n+ (^j je 

3.2 Conditioning numbers and examples 

The full covariance matrix for a smooth Gaussian process tracked at a dense set of 
locations will be ill-conditioned and nearly rank-deficient in practice, with propagation 
of rounding off errors due to finite precision arithmetic, the inverses may be highly 
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unstable and severely degrade the quality of the inference. To see this, consider the 
simple example with covariance function k(x,y) = e~°' 5 ( x ~ v ) , evaluated at the points 
0.1, 0.2, which gives the covariance matrix, 



K 

which yields the inverse, 



1.000 0.995 
0.995 1.000 



100.5008 -99.996 
-99.996 100.5008 



Perturbing the covariance kernel slightly to k(x, y) = e °- 75 ( x v) > yields a very sim- 
ilar covariance matrix, 



1.0000 0.9925 
0.9925 1.0000 I ' 



with \\K— K new \\p = 0.0035. However the inverse of the covariance matrix drastically 
changes to 



K~ l = 

new 



67.1679 -66.6660 
-66.6660 67.1679 



with \\K~ l — K~£ w \\f = 66.6665. With such a small change in the magnitude of 
some elements, we have a huge change in its inverse, which would lead to widely 
different estimates and predicted values. The problem is obviously much aggravated in 
large data sets and in Bayesian settings where there the posterior is explored through 
several rounds of iterations, say in an Gibbs sampling scheme. How well a covariance 
matrix K is conditioned may be measured by the conditi oning number , 2L , where 



<ti,<j s are its largest and smallest eigenvalues respectively (Dixon, 1983). Condition 
numbers are best when they are close to 1, very large ones indicate numerical instability 
- in the example above, the condition number of the matrix K is « 400. Condition 
number arguments imply that low rank approximations may not only be necessitated by 
computational considerations but may indeed be desirable for better inference over the 
full covariance matrix. It therefore makes practical sense to choose amongst two low 
rank approximations of comparable rank or accuracy, the one that is better conditioned. 
We now show empirically how condition number is improved greatly with the random 
projection approximation over the knot based schemes, when considering either a fixed 
rank or target error approach. 

We first evaluate with respect to the fixed rank question. Consider a similar covari- 
ance kernel as above k(x, y) = e~( x ~ y ^ , and evaluate it over a uniform grid of 1000 
points in [0.1, 100], and consider the resulting 1000 x 1000 covariance matrix K. The 
condition number of K w 1 .06 5 2 x 10 20 , which indicates it is severely ill-conditioned. 
We now apply AlgorithmQ] with r = m, for different choices of the target rank m and 
calculate the error in terms of the Frobenius and spectral norms, conditioning numbers 
and the time required. For each choice of m, we also consider the approximation as 
would given by the approaches of §2 • 1 in two ways, (1) randomly selecting m grid 
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points out of the 1000, call this PP1; and (2) selecting the grid points by the partial 
Choleski factorization with pivoting, as in iTokdai (2011), call this PP2, which can be 



interpreted as a systematic implementation of the suggested approach in Finl ev et al 



(2009). Results are summarized in table[T]for some values of m. The random projec- 
tion approach clearly has better approximation accuracy than the other methods - this 
becomes more marked with increase in dimension of the approximation. The condi- 
tion numbers for the random projection scheme are dramatically better than the other 
2 approaches, indicating superior numerical stability and reliable estimates. 

Next we compare with respect to achievement of a target error level. For the ran- 
dom projection approach, we implement Algorithm [2] For this comparison it would 
be useful to know the best possible rank at which the target error would be achieved if 
we knew the real spectral decomposition. For this purpose we consider matrices of the 
form K = EDE T , where E is an orthonormal matrix and D is diagonal. The diago- 
nal elements of D, which are the eigenvalues of K are chosen to decay at e xponential 



rates, which holds for smooth covariance kernels dFrauenfelder et al.L 120051) . with i th 
element da = e~ lX ; for the simulations tabulated, we use A = 0.5, 0.08, 0.04 respec- 
tively. E is filled with independent standard normal entries and then orthonormalized. 
Algorithm |2] for random projections, PP1 and PP2 as above are applied to achieve dif- 
ferent Frobenius norm error levels e for different values of matrix order n. Results 
are shown in table [2] Clearly random projection achieves the desired target error level 
with lower ranks for all different values of e and n; also real CPU times required are 
comparable, in fact the random projection approach has lower time requirements when 
the rank differences become significant. 

Lower target ranks, besides the obvious advantages of computational efficiency and 
stability, imply lesser memory requirements, which is an important consideration when 
sample size n becomes very large. Time required for matrix norm calculations for 
checking target error condition for PP1 or PP2 are not counted in the times shown. All 
times here as well as in following sections, are in seconds and calculated when running 
the algorithms in Matlab 7.10 version R2010a on a 64bit CentOS 5.5 Linux machine 
with a 3.33 Ghz dual core processor with 8Gb of random access memory. The random 
projection benefits from the default parallel implementation of matrix multiplication 
in Matlab. Lower level implementations of the algorithms, for example C/C++ im- 
plementations would require parallel matrix multiplication implementation to achieve 
similar times. With a GPU implementation with parallel matrix multiplication, random 
projection approximation can be significantly speeded up. 



4 Parameter Estimation And Illustrations 

4.1 Bayesian inference for the parameters 

An important part of implementing Gaussian process regression is estimation of the 
unknown parameters of the covariance kernel of the process. Typically the covari- 
ance kernel is governed by 2 parameters, characterizing its range and scale. We shall 
consider the squared exponential kernel used earlier, k(x, y) = j^e~ ei for sim- 

plicity, but the techniques herein shall be more generally applicable. 9\ and O2 are the 
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range and inverse scale parameters respectively. We shall use Bayesian techniques for 
inference here to fully explore the posterior over all possible values of these param- 
eters, also applying the random projection scheme for repeated iterations of Markov 
chain samplers will allow us to fully demonstrate its power. 

For Bayesian inference, we have to specify prior distributions for each of the un- 
known parameters, namely 9i,&2 and a 2 , the variance of the idiosyncratic noise in 
equation ([T). In place of ([T), using the random projection, we have, 

Ui = 9RM{xi) + ei, i = 1, . . . , n. (4) 

Using the bias corrected form for the random projection approximation the prior for the 
unknown function is, [g RM {X)\d 1 ,0 2 } ~ N(Q,Qf*f), where Qff = Qf p } + D M , 
with Dm the diagonal matrix as obtained for variance augmentation from equation 
©. Letting t = a~ 2 and choosing conjugate priors, we let r ~ Ga(ai,6i), 02 ~ 
Ga(a2,&2) and 0\ ~ Ylh=iO-/t)^ct> denoting a discrete uniform distribution with 
atoms {ci, . . . , Ct}- The Ga(a, b) gamma density is parametrized to have mean a/b 
and variance a/b 2 . The priors being conditionally conjugate, we can easily derive the 
full conditional distributions necessary to implement a Gibbs sampling scheme for the 
quantities of interest as follows, 

[9rm(x)\-\ ~ miQfj)- 1 + nr'Y, mfj)- 1 + ny 1 ] 

[t\-] ~ Ga[ ai + |, bx + {Y - g RM (X)} T {Y - g RM (X)}} 
[9 2 \-] ~ Ga(6 2 + / T g- V) 

pr(0! = Ci|— ) = cjdetQ^f |-4 e -i»*«W !r W/./ r )-WW 

where Q = 62Q ] f 1 f and c is a constant such that 5Z* =1 Prob(^i = Cj|— ) = 1. 
We can integrate out the Gaussian process g R M{X) from the model to obtain Y ~ 
N(0, {Q/ / f + t - this form is useful for prediction and fitting. We show some 
relevant computational details for the matrix inversion using the Woodbury matrix iden- 
tity in the appendix. 

For computational efficiency, we pre-compute the random projection matrix for 
each of the discrete grid points for 6\ and the corresponding matrix inverse required for 
the other simulations. Changes in the parameter 6 2 do not affect the eigendirections, 
hence we do not recompute the projection matrix $ and we can compute the new 
inverse matrix due to a change in 62 by just multiplying with the appropriate scalar. 
Although other prior specifications are extensively discussed in the literature, we have 
considered simple cases to illustrate the efficacy of our technique. It is observed that 
inference for the range parameter B\ is difficult and Markov chain Monte Carlo schemes 
tend to have slow mixing due to high correlation between the imputed functional values 
and the parameter. The random projection approximation appears to take care of this 
issue in the examples considered here. 

4.2 Illustrations 

We first consider a simulated data example where we generate data from functions 
corresponding to a mixture of Gaussian kernels in [0,1]. We consider functions with 
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3 different degrees of smoothness - an almost fiat one, a moderately wavy one and a 
highly wavy one. For each of these functions, we consider 10, 000 equi-spaced points 
in [0, 1] and we add random Gaussian noise to each point - this constitutes our ob- 
served data set Y . We randomly select 9, 000 points for model fitting and the rest for 
validation. We now implement random projection with Algorithm [2] with a couple of 
different target error levels (0.1,0.01) referred to as RP. We compare it with predic- 
tive process with equispaced selection of k nots and with t he modified version of knot 
selection by pivoted Choleski factorization (Tokdar, 201 lh explained in §3.2, referred 
to as PP1 and PP2 respectively. In this simulated example, as well as in the real data 
examples we use the squared exponential covariance kernel with prior specifications 
as in the previous section. For the idiosyncratic noise, we use hyperparameters a\ , b\ 
such that the mean is approximately equal to estimated noise precision with ordinary 
least square regression. In particular for the smooth one we use a\ — \,b\ — 10. Hy- 
perparameter choices for covariance kernel parameters are guided by some trial runs, 
we use a grid of 2000 equispaced points in [0, 2] for Q\ and a-i — 2, 62 = 20 for 62- 
We run Gibbs samplers for 10, 000 iterations with the first 500 discarded for burn-in. 
We calculate the predicted values for the held-out set with the posterior means of the 
parameters from the Gibbs iterations and we also calculate the average rank required 
to achieve the target accuracy over the iterations. Effective sample size is calculated 
by using the output for the Markov chains with the CODA package in R. The results 
are tabulated in table [3] whereby random projection has substantial gain in predictive 
accuracy and in the target rank required, as well as substantially better effective sample 
sizes for the unknown parameters of the covariance kernel as well as for the predicted 
points. With the predictive process type approaches, we would need substantially more 
Markov chain Monte Carlo iterations to achieve similar effective sample sizes, leading 
to an increased computational cost. 

We finally consider a couple of real data examples, which have been used earlier for 
reduced rank approaches in Gaussian process regression, of contra sting sizes. The first 
is the abalone dataset, from the UCI machine learning database dFrank & Asuncion , 
20101) . where the interest is in modeling the age of abalone, given other attributes, 



which are thought to be non-linearly related to age. The dataset consists of 4000 train- 
ing and 177 test cases. We use Euclidean distance between the attributes for our covari- 
ance function for the Gaussian process and for the gender attribute, (male/female/infant) 
is mapped to {(1, 0, 0), (0, 1, 0), (0, 0, 1)}. The other example we consider is the Sar- 
cos Robot arm, where we are interested in the torque as given by the 22 nd column 
given the other measurements in the remaining 21 columns. This dataset has 44, 484 
training and 4, 449 test cases. We once again consider Euclidean distances between 
the attributes. For each of the experiments, we use Algorithmic with target error level 
0.01. The hyperparameters for each example is chosen in similar fashion as the sim- 
ulated example. This leads to choosing a± = l,bi = 0.1 for the abalone dataset and 
a\ = 2, bi = 0.1 for the Sarcos Robot arm. The grid for 6\ in either case is 2000 
equispaced points in [0, 2]; for 9 2 , in abalone we have a% = 1, 62 = 1 while we have 
a 2 = 1, 62 = 0.75 for Sarcos Robot arm. The Gibbs sampler for the abalone data set is 
run for 10, 000 iterations with 1, 000 discarded for burn-in, while for the Sarcos Robot 
arm, it is run for 2, 000 iterations with 500 discarded for burn-in. 

The results for both these experiments, tabulated in table [4] There is improve- 
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ment in predictive accuracy when using random projections in both the examples, in 
particular for the Robot arm dataset predictive accuracy is significantly better. This 
improvement perhaps is a consequence of the fact that we get better estimation for the 
parameters when using the random projection approach. In the Sarcos Robot arm data, 
both the covariance kernel parameters are readily observed to have different posteriors 
with the random projection approach. This is a consequence of the poor behavior of the 
Markov chains for these parameters, they exhibit poor mixing. In fact this problem of 
poor mixing when approximating a stochastic process by imputed points is not unique 
to Gaussian processes, it have been observed in other contexts too, possibly due to the 
chains for t he imputed points and the unk nown parameters being highly correlated with 
each other (Golightly & Wil kinson , 2006). The random projection approach appears to 
improve this to a great extent by not considering specific imputed points. The infer- 
ence is not very sensitive to the choice of hyperparameters, with datasets of this size 
we are able to overcome prior influence if any. In particular in trial runs with smaller 
number of iterations, changing the grid for 9\ to 1000 uniformly spaced points in [0, 1] 
yielded almost similar results, random projection performing better than the knot based 
approaches. 



5 Concluding Remarks 

We have developed a broad framework for reduced rank approximations under which 
almost the entire gamut of existing approximations can be brought in. We have tried 
to stochastically find the best solution under this broad framework, thereby leading 
to gains in performance and stability over the existing approaches. Another impor- 
tant contribution has been to connect not only the machine learning and statistical 
approaches for Gaussian process approximation, but also to relate them to matrix ap- 
proximations themes - we have shown that the reduced rank Gaussian process schemes 
are effectively different flavors of approximating the covariance matrix arising therein. 
The random projection approach has been mainly studied as an approximation scheme 
in this article, it is also worthwhile considering it from a model based perspective and 
investigate the added flexibility it offers as an alternative model. 

We have not explored the performance of parallel computing techniques in this con- 
text, though we have indicated how to go about parallel versions of the algorithms at 
hand. Further blocking techniques and parallelization remains an area of future interest. 
We also plan on working out the multivariate version of random projection approxima- 
tions. In ongoing work, we explore similar approaches in other different contexts - 
a couple of examples being in the context of functional modeling, where the domain 
may be discrete and also in the case of parameter estimation for diffusion processes - 
where similar dimensionality problems are faced sometimes in terms of their discrete 
Euler approximations. In other ongoing work, we also explore the theoretical rates of 
convergence of the truncated expressions for different classes of covariance kernels and 
convergence of the associated posterior distributions of the unknown parameters. 
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Appendix 
Proof of Theorem 1 

By construction, 

K tr = UD 2 U T = UDV T VDU T = CC T 

= K^> T (B T )- 1 B- 1 <PK = K^> T (BB T )- 1 <PK 
= K^ T K^K = {<&K) T {§K<S> T )- l §K 

This shows that the reduced S VD form, K tr produced by Algorithm[T]is indeed equal 
to the random projection approximation, which is equal to a generalized projection 
matrix as explained below. 

The generalized rank m projection matrix for the projection whose range is spanned 
by the columns of an n x m matrix A, with m <n and whose nullity is the orthogonal 
complement of the range of n x m matrix B, is given b y A{B T A) -1 B T . This is a 



generalization of the standard projection matrix formula dDokovid 11991). Therefore, 
K tr = PK, where P — A$ T {<l>(A"<i> T )}~ 1 $ is the generalized projection matrix 
with range spanned by the columns of A$ T and whose nullity is the orthogonal com- 
plement of the range of <1> T . Again, by construction, range of $ T = range of Kfl and 
therefore, range of A"$ T = range of K 2 Vl — range of Kti. Finally since range of KVt 



= row- space of Q T K, the result follows by a direct application of theorem 14 in lSarlos 
(120061) . 



Proof of Theorem 2 

The Kullback Leibler divergence between two n— variate normal distributions Afo = 

N(/io, S ) & M = N(mi,Si) is given by, 

KL(JV ||M) = ~ [tr (£r%) - n - log {det (E^o)} + - Vof ^(^i - Mo) 

Inourcase,A/o = Kfuii = N(y; 0, K fJ + a 2 I) andAAi = tt rp = N(y; 0, Qf p f +a 2 I). 
Therefore AL(7r /u/; ,7r fl p) = \ [tr (E^Eq) -n-log{det (E^ 1 E ) }] , with S = 
K fJ + a 2 I and E x = Qf p f +a 2 I. We have ||E - = \\K fJ - Qfj\\ F < e. 

Break the expression for the Kullback Leibler divergence into 2 parts with the first 
part, 
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tr(E 1 1 E ) - n = tr {E x ^Eq - E x )} = Yh=i Z)"=i s »i d i« where sy,^ 



are 



(ij) & (ji)*' 1 elements of E^ , (E - Si) respectively. Then, 



^(E^Eo) - n< llSr^Uax^S^ ^ H^ILa^e (5) 

i=l j=l 

In the inequality above we use ||E^ || maa; = maxy sy and the fact that ||Eo— Si ||jr < 
e =► Yn=i J2"=i d ji < "- 2e - Now ll s i '"llmaa; < llEj" 1 !^- Since E^ 1 is symmetric 
postive definite, HEj" 1 H2 is the largest eigenvalue of E^ 1 which is equal to the inverse 
of the smallest eigenvalue of Ei. Recall that Ei = Qf^f + o 2 I and Qf 1 ^ is positive 
semi-definite and has non negative eigenvalues. Therefore all eigenvalues of Ei > a 2 , 
and using this in conjunction with inequality ||5), we have, 

tr (Er So) - N < (-J e (6) 

It remains to bound the second part of the divergence expression. We have det (E^~ Eo) 
(Il7=i ^i) I (nr=i ^1)> where A , A] are eigenvalues of E fc Si respectiv ely. Since 



Eg, Ei are symmetric, by the Hoffman- Weilandt inequality dBhatia , 1997b . there ex 



ists a permutation p such that X)"=i l^p(i) — M } — 11^0 — ^iIIf — ^ ■ Therefore 
with the same permutation p, we have for each i, ^ [1 — e )l + e ]- Trivial 

manipulation then yields, log {det (E|j~ 1 Eo) } S [n log(l — e), n log(l + e)], so that, 

-log{det(Sr 1 E )} <ne (7) 
Combining inequalities (|6]l and (0, we have, 

KL(7r/„H,7riip) < |n+ je 

which completes the proof. 

This is not an optimal bound, but serves our basic goal of showing that the Kullback 
Leibler divergence is of the same order as the error in estimation of the covariance 
matrix in terms of Frobenius norm. Additional assumptions on the eigenspace of the 
covariance matrix would yield tighter bounds. 

Example of inversion with the Woodbury matrix identity 

Either of the algorithms,[T]or[2] in this paper yield Qf^ = UD 2 U T , with U T U = I. 
We would be interested in calculating E^ 1 = (Qf 1 ^ + (J 2 !)^ 1 , in t he margin alized 



form for inference or prediction. Using the Woodbury matrix identity (IHarville , 2008) 
we have, 

Ei 1 = (j 2 1 - <j- 2 u(d- 2 + a- 2 u T u)- l u T a~ 2 

= a- 2 I - <j- a U{D- 2 + o- 2 iy x U T 

In the above D~ 2 + a~ 2 I is a diagonal matrix whose inverse can be obtained by just 
taking reciprocals of the diagonals. Thus direct matrix inversion is entirely avoided 
with the decomposition available from the algorithms. 
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Form=10 ||||f ||||2 Cond No Time 

RP 106.1377 17.6578 L0556 O06" 

PP1 107.6423 17.6776 1.2356 0.04 

PP2 106.6644 17.6778 1.2619 0.04 



For to = 25 || \\ F UH2 Cond No Time 

RP 82.1550 17.2420 L7902 022" 

PP1 91.1016 17.5460 230.236 0.18 

PP2 85.6616 17.3800 13.8971 0.21 



For to = 50 \\\\f HH2 Cond No Time 

RP 50.5356 14.2998 2.9338 OZT 

PP1 79.1030 17.0172 2803.5 0.24 

PP2 69.5681 15.6815 876.23 0.25 



For to =100 |||| F |||| 2 Cond No Time 

RP (Algol) 6^6119 2T8383 20.6504 O40" 

PP1 39.9642 13.1961 1.3815 xlO 6 0.31 

PP2 10.1639 6.3082 1792.1 0.36 



Table 1: Comparative performance of the approximations in terms of matrix error 
norms, with the random projection approach based on Algorithm[T] 





PP1 


PP2 


RP 


n = 100, e = 0.1, optimal m = 5 


Required Rank 


17 


9 


7 


Cond No 


298.10 


54.59 


20.08 


Time 


0.03 


0.04 


0.07 


n= 1000, e = 0.01, optimal m = 69 


Required Rank 


213 


97 


78 


Cond No 


2.30 x 10' 


2164.6 


473.43 


Time 


12.1 


11.5 


36.2 


n - 10000, e - 0.01, optimal m - 137 


Required Rank 


1757 


793 


174 


Cond No 


3.19 x 10 iy 


2.30 x 10 y 


1012.3 


Time 


335 


286 


214 



Table 2: Comparison of the ranks required to achieve specific target errors by the dif- 
ferent algorithms, with random projection based on Algorithmic] 



21 





PP1 


PP2 


RP 


e = 0.1, smooth 


MSPE 


11.985 


8.447 


3.643 


Avg Required Rank 


1715.6 


453.8 


117.2 




95% Interval Required Rank 


[1331,2542] 


[377,525] 


[97,141] 




Posterior Mean, 6\ 


0.09 


0.10 


0.06 




95% credible interval, 0i 


[0.05,0.14] 


[0.05,0.15] 


[0.04,0.08] 




ESS, 0i 


496 


870 


1949 




Posterior Mean, 02 


0.91 


1.15 


1.25 




95% credible interval, 2 


[0.58,1.58] 


[0.85,1.43] 


[1.09,1.46] 




ESS, 2 


2941 


3922 


4518 




Avg ESS, Predicted Values 


2190 


3131 


5377 




Time 


39761 


29355 


32365 


e = 0.01, wavy 


MSPE 


10.114 


6.891 


2.265 


Avg Required Rank 


3927.1 


941.5 


129.7 




95% Interval Required Rank 


[2351,5739] 


[868,1165] 


[103,159] 




Posterior Mean, 0i 


0.07 


0.08 


0.13 




95% credible interval, 0i 


[0.01,0.14] 


[0.02,0.15] 


[0.09,0.17] 




ESS, 0i 


574 


631 


1918 




Posterior Mean, 2 


0.83 


0.85 


0.79 




95% credible interval, 2 


[0.21,1.74] 


[0.40.1.63] 


[0.45,1.29] 




ESS, 2 


3679 


4819 


5002 




Avg ESS, Predicted Values 


2875 


3781 


5769 




Time 


78812 


47642 


33799 


e = 0.01, very wavy 


MSPE 


17.41 


13.82 


6.93 


Avg Required Rank 


4758.5 


1412.5 


404.5 




95% Interval Required Rank 


[2871,6781] 


[1247,1672] 


[312,475] 




Posterior Mean, 0i 


0.11 


0.09 


0.05 




95% credible interval, 0i 


[0.04,0.17] 


[0.05,0.13] 


[0.03,0.08] 




ESS, 0i 


741 


747 


1049 




Posterior Mean, 02 


1.27 


1.18 


1.19 




95% credible interval, 2 


[1.08,1.43] 


[1.12,1.41] 


[1.15,1.34] 




ESS, 2 


1521 


2410 


2651 




Avg ESS, Predicted Values 


1263 


1415 


2422 




Time 


89715 


57812 


47261 



Table 3: Simulated data sets with the target error algorithm for the three different 
simulations. Different algorithms compared in terms of preditive MSE and various 
posterior summaries for the unknown parameters. ESS stands for effective sample 
size. 
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PP1 


PP2 


RP 


Abalone Dataset 


MSPE 


1.785 


1.517 


1.182 


Avg Reqd Rank 


417.6 


328.8 


57.2 




95% Interval Required Rank 


[213,750] 


[207,651] 


[43,71] 




Posterior Mean, 0\ 


0.212 


0.187 


0.149 




95% credible interval, 1 


[0.112,0.317] 


[0.109,0.296] 


[0.105,0.207] 




ESS, 6>i 


516 


715 


1543 




Posterior Mean, 2 


0.981 


1.014 


1.105 




95% credible interval, 2 


[0.351,1.717] 


[0.447,1.863] 


[0.638,1.759] 




ESS, 2 


1352 


1427 


1599 




Time 


19468 


21355 


15423 


Sarcos Robot Arm 


MSPE 


0.5168 


0.2357 


0.0471 


Avg Reqd Rank 


4195 


2031 


376 




95% Interval Required Rank 


[3301,4985] 


[1673,2553] 


[309,459] 




Posterior Mean, 0i 


0.496 


0.352 


0.105 




95% credible interval, 6\ 


[0.087,0.993] 


[0.085,0.761] 


[0.042,0.289] 




ESS, 0i 


85 


119 


147 




Posterior Mean, 02 


1.411 


1.315 


1.099 




95% credible interval, 2 


[1.114,1.857] 


[1.065,1.701] 


[1.002,1.203] 




ESS, 2 


145 


132 


227 




lime 


57213 


53929 


20869 



Table 4: Comparison of the different algorithms based on their performance in the 
experimental data sets in terms of preditive MSE and various posterior summaries for 
the unknown parameters. ESS stands for effective sample size. 
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